-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSProcessing] Add pre-computed categorical transformation loading #870
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
thvasilo
added
0.3
gsprocessing
For issues and PRs related the the GSProcessing library
labels
Jun 10, 2024
thvasilo
commented
Jun 12, 2024
graphstorm-processing/graphstorm_processing/distributed_executor.py
Outdated
Show resolved
Hide resolved
thvasilo
commented
Jun 12, 2024
graphstorm-processing/graphstorm_processing/graph_loaders/dist_heterogeneous_loader.py
Show resolved
Hide resolved
thvasilo
commented
Jun 12, 2024
thvasilo
commented
Jun 12, 2024
.../graphstorm_processing/data_transformations/dist_transformations/base_dist_transformation.py
Show resolved
Hide resolved
jalencato
reviewed
Jun 12, 2024
.../graphstorm_processing/data_transformations/dist_transformations/base_dist_transformation.py
Show resolved
Hide resolved
...phstorm_processing/data_transformations/dist_transformations/dist_category_transformation.py
Outdated
Show resolved
Hide resolved
classicsong
reviewed
Jun 14, 2024
.../graphstorm_processing/data_transformations/dist_transformations/base_dist_transformation.py
Show resolved
Hide resolved
...phstorm_processing/data_transformations/dist_transformations/dist_category_transformation.py
Show resolved
Hide resolved
graphstorm-processing/graphstorm_processing/graph_loaders/dist_heterogeneous_loader.py
Show resolved
Hide resolved
graphstorm-processing/graphstorm_processing/graph_loaders/dist_heterogeneous_loader.py
Show resolved
Hide resolved
graphstorm-processing/graphstorm_processing/graph_loaders/dist_heterogeneous_loader.py
Show resolved
Hide resolved
Any concerns left for this PR? |
I am OK with the PR. |
classicsong
approved these changes
Jun 17, 2024
jalencato
approved these changes
Jun 17, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
thvasilo
added a commit
that referenced
this pull request
Jul 11, 2024
…latent bugs (#915) *Issue #, if available:* *Description of changes:* * During the refactor with #870 we moved the loader to be a class var for DistributedExecutor, but because the S3 path is not unit tested we missed on case where the output is on S3 and the user requests repartition on leader. * This error was actually picked up by mypy, I fixed some other potential issues and type annotations here. ### Testing Pre-commit, unit tests, and one test SageMaker job all succeed. The S3 codepath can only be integration tested currently. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
0.3
gsprocessing
For issues and PRs related the the GSProcessing library
ready
able to trigger the CI
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue #, if available:
Description of changes:
To be able to re-apply the categorical transformations that we create using the code in #857 , we first create a mapping from original string to one-hot representation, that we read from the saved JSON file, then use a UDF to use the mapping(s) on the column(s).
The
DistributedTransformation
class from which all transformation implementations inherit, gains a new function,apply_precomputed_transformation
. When a pre-computed transformation JSON file exists in the input, and the feature is one of those listed in that file, we use this function to re-apply the existing transformation instead of creating a new one.The default implementation for
apply_precomputed_transformation
is to log a warning and apply a new transformation.When we implement a pre-computed transform for a new transformation (e.g. numerical) we need to:
self.json_representation
is populated during the call toapply()
. This ensures the transformation info will be saved in the output JSON.apply_precomputed_transformation
function (as we did forDistCategoryTransformation
here), so that it uses the dict loaded from the JSON file to re-apply the transformation to the new data.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.